The "Bank Marketing" dataset is originally from in the UCI Machine Learning Repository, which includes data related to direct marketing campaigns, specifically phone calls, conducted by a Portuguese banking institution. These campaigns aim to promote financial products or services to clients.
The dataset provides insights into various aspects of these campaigns, including client demographics, communication details, and the outcomes of previous marketing efforts. The goal is to understand patterns and factors that influence clients' decisions to subscribe to a term deposit, ultimately aiding in optimizing future marketing strategies.
Further information of the dataset could be found here: https://archive.ics.uci.edu/dataset/222/bank+marketing
Input Variables - Bank Client Data:
Related with the Last Contact of the Current Campaign:
Other Attributes:
Output Variable - Desired Target:
A Term Deposit, also referred to as a Time Deposit, is a financial product provided by banks and financial institutions. It entails depositing a designated sum of money for a predefined period, often referred to as the "term" or "tenure." Throughout this tenure, the deposited funds accumulate interest at a fixed rate, typically higher than the interest offered by standard savings or deposit accounts.
Our objective in this analysis is to thoroughly examine and predict the likelihood of customers subscribing to a term deposit. This entails investigating the factors and attributes that influence a customer's decision to subscribe, utilizing data-driven methodologies to develop predictive models, and ultimately making informed predictions based on customer characteristics and behavior.
Understanding Data Distribution:
Categorical Feature Analysis:
Time Series Analysis:
Correlation Analysis:
Imbalance Check:
Data Preprocessing:
Feature Selection:
Model Selection:
Hyperparameter Tuning:
Model Evaluation:
Feature Importance:
Model Deployment:
import pandas as pd
import numpy as np
pd.set_option('display.float_format', '{:.2f}'.format)
pd.set_option('display.max_columns', None)
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style='ticks')
%config InlineBackend.figure_format = 'retina'
plt.rcParams["axes.spines.right"] = False
plt.rcParams["axes.spines.top"] = False
from sklearn.preprocessing import StandardScaler
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from xgboost import plot_importance
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, classification_report
from sklearn.metrics import roc_curve, roc_auc_score
original_df = pd.read_csv('bank+marketing/bank/bank-full.csv', delimiter=';')
print(original_df.shape)
original_df.head()
(45211, 17)
| age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 58 | management | married | tertiary | no | 2143 | yes | no | unknown | 5 | may | 261 | 1 | -1 | 0 | unknown | no |
| 1 | 44 | technician | single | secondary | no | 29 | yes | no | unknown | 5 | may | 151 | 1 | -1 | 0 | unknown | no |
| 2 | 33 | entrepreneur | married | secondary | no | 2 | yes | yes | unknown | 5 | may | 76 | 1 | -1 | 0 | unknown | no |
| 3 | 47 | blue-collar | married | unknown | no | 1506 | yes | no | unknown | 5 | may | 92 | 1 | -1 | 0 | unknown | no |
| 4 | 33 | unknown | single | unknown | no | 1 | no | no | unknown | 5 | may | 198 | 1 | -1 | 0 | unknown | no |
df = original_df.copy()
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 45211 entries, 0 to 45210 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 45211 non-null int64 1 job 45211 non-null object 2 marital 45211 non-null object 3 education 45211 non-null object 4 default 45211 non-null object 5 balance 45211 non-null int64 6 housing 45211 non-null object 7 loan 45211 non-null object 8 contact 45211 non-null object 9 day 45211 non-null int64 10 month 45211 non-null object 11 duration 45211 non-null int64 12 campaign 45211 non-null int64 13 pdays 45211 non-null int64 14 previous 45211 non-null int64 15 poutcome 45211 non-null object 16 y 45211 non-null object dtypes: int64(7), object(10) memory usage: 5.9+ MB
df.isnull().sum().to_frame(name='missing').T
| age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Note:
Dataset Overview:
Data Composition:
Target Variable:
y column indicates term deposit subscription (1) or non-subscription (0), used for classification tasks.Data Quality:
This dataset appears well-prepared for analysis and predictive modeling tasks.
Transform month into numerical data:
df['month'].unique()
array(['may', 'jun', 'jul', 'aug', 'oct', 'nov', 'dec', 'jan', 'feb',
'mar', 'apr', 'sep'], dtype=object)
month_mapping = {
'jan': 1, 'feb': 2, 'mar': 3, 'apr':4, 'may': 5, 'jun': 6, 'jul': 7, 'aug': 8, 'sep':9, 'oct': 10, 'nov': 11, 'dec': 12
}
df['month'] = df['month'].map(month_mapping)
numeric_cols = [col for col in df.columns if df[col].dtype != object]
object_cols = [col for col in df.columns if df[col].dtype == object]
print(f"{len(numeric_cols)} numerical features: \n {numeric_cols}")
print()
print(f"{len(object_cols)} categorical features: \n {object_cols}")
8 numerical features: ['age', 'balance', 'day', 'month', 'duration', 'campaign', 'pdays', 'previous'] 9 categorical features: ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'poutcome', 'y']
df.describe()
| age | balance | day | month | duration | campaign | pdays | previous | |
|---|---|---|---|---|---|---|---|---|
| count | 45211.00 | 45211.00 | 45211.00 | 45211.00 | 45211.00 | 45211.00 | 45211.00 | 45211.00 |
| mean | 40.94 | 1362.27 | 15.81 | 6.14 | 258.16 | 2.76 | 40.20 | 0.58 |
| std | 10.62 | 3044.77 | 8.32 | 2.41 | 257.53 | 3.10 | 100.13 | 2.30 |
| min | 18.00 | -8019.00 | 1.00 | 1.00 | 0.00 | 1.00 | -1.00 | 0.00 |
| 25% | 33.00 | 72.00 | 8.00 | 5.00 | 103.00 | 1.00 | -1.00 | 0.00 |
| 50% | 39.00 | 448.00 | 16.00 | 6.00 | 180.00 | 2.00 | -1.00 | 0.00 |
| 75% | 48.00 | 1428.00 | 21.00 | 8.00 | 319.00 | 3.00 | -1.00 | 0.00 |
| max | 95.00 | 102127.00 | 31.00 | 12.00 | 4918.00 | 63.00 | 871.00 | 275.00 |
Note:
From the descriptive statistics, we can draw the following insights about the numeric variables in the dataset:
age: The age of clients ranges from 18 to 95 years, with the majority falling within the 33 to 48 age range. The average age of clients who participated in the campaign is approximately 41 years.
balance: The balance in clients' accounts exhibits significant variation, ranging from a minimum of -8019 euros to a maximum of 102127 euros. While there are clients with substantial balances, the majority have account balances under 1400 euros. Notably, this variable includes negative balances, indicating financial liabilities. The distribution of this variable is right-skewed, as evidenced by the difference between the mean and median (significant max value contributes to this skewness).
day: The last contact day of the month during the campaign has an average value of approximately 16. The day varies between 1 and 31, indicating a fairly uniform distribution across the month.
month: month columns seems like not very informative with describe().
duration: The average duration of the last contact with clients during the campaign is about 258.16 seconds. The duration varies widely, ranging from short interactions of 0 seconds to lengthy conversations of up to 4918 seconds. Longer contact durations suggest more involved conversations or interactions.
campaign: On average, clients were contacted approximately 2.76 times during this campaign. The number of contacts per client ranges from 1 to 63, with the majority of clients receiving 1 to 3 contacts. There are some cases with higher contact frequencies.
pdays: The average number of days since the client was last contacted from a previous campaign is approximately 40.20 days. A value of -1 indicates that the client was not contacted previously. Interestingly, some clients have been contacted as long as 871 days after a previous campaign, suggesting varying time intervals between campaigns.
previous: The average number of contacts performed before this campaign and for this client is around 0.58. Most clients had no previous contacts, but there are instances where clients were contacted up to 275 times before this campaign, indicating higher engagement levels.
_, axes = plt.subplots(4, 4, figsize=(16, 16))
for i, col in enumerate(numeric_cols[:4]):
sns.histplot(data=df, x=col, ax=axes[0, i])
axes[0, i].set_xlabel('')
axes[0, i].set_title(f'Histogram of {col}')
if i != 0:
axes[0, i].set_ylabel('')
sns.boxplot(data=df, x=col, ax=axes[1, i])
axes[1, i].set_xlabel('')
axes[1, i].set_title(f'Boxplot of {col}')
for i, col in enumerate(numeric_cols[4:]):
sns.histplot(data=df, x=col, ax=axes[2, i])
axes[2, i].set_xlabel('')
axes[2, i].set_title(f'Histogram of {col}')
if i != 2:
axes[2, i].set_ylabel('')
sns.boxplot(data=df, x=col, ax=axes[3, i])
axes[3, i].set_xlabel('')
axes[3, i].set_title(f'Boxplot of {col}')
plt.tight_layout()
Note:
age, duration, balance, campaign, pdays, and previous are significantly right-skewed, with a substantial number of outliers toward the higher values.day and month appear to be relatively normal, with a higher frequency of last contacts occurring around the middle of the month and year respectively.balance, duration, campaign, pdays, and previous exhibit small interquartile ranges (IQRs), suggesting that most customers have values for these variables concentrated around the median. However, the presence of numerous outliers contributes to the wide range observed in these variables.These observations provide insights into the characteristics of the dataset's numeric variables, highlighting the presence of outliers and the distributions' skewness. These insights will be important to consider when performing further analysis and modeling.
df.describe(include='object')
| job | marital | education | default | housing | loan | contact | poutcome | y | |
|---|---|---|---|---|---|---|---|---|---|
| count | 45211 | 45211 | 45211 | 45211 | 45211 | 45211 | 45211 | 45211 | 45211 |
| unique | 12 | 3 | 4 | 2 | 2 | 2 | 3 | 4 | 2 |
| top | blue-collar | married | secondary | no | yes | no | cellular | unknown | no |
| freq | 9732 | 27214 | 23202 | 44396 | 25130 | 37967 | 29285 | 36959 | 39922 |
for col in object_cols:
unique_values = df[col].unique()
if len(unique_values) > 5:
print(f"There are {len(unique_values)} unique values in column '{col}'.")
else:
print(f"There are {len(unique_values)} unique values in column '{col}': {unique_values}")
There are 12 unique values in column 'job'. There are 3 unique values in column 'marital': ['married' 'single' 'divorced'] There are 4 unique values in column 'education': ['tertiary' 'secondary' 'unknown' 'primary'] There are 2 unique values in column 'default': ['no' 'yes'] There are 2 unique values in column 'housing': ['yes' 'no'] There are 2 unique values in column 'loan': ['no' 'yes'] There are 3 unique values in column 'contact': ['unknown' 'cellular' 'telephone'] There are 4 unique values in column 'poutcome': ['unknown' 'failure' 'other' 'success'] There are 2 unique values in column 'y': ['no' 'yes']
Leave out variable job with 12 unique values:
categorical_cols = [
'marital',
'education',
'default',
'housing',
'loan',
'contact',
'poutcome',
'y'
]
_, axes = plt.subplots(2, 4, figsize=(16, 8))
for i, col in enumerate(categorical_cols):
sns.countplot(data=df, x=col, palette='mako', ax=axes[i // 4, i % 4])
axes[i // 4, i % 4].set_title(f"Count of {col}")
axes[i // 4, i % 4].set_xlabel('')
axes[i // 4, i % 4].set_ylabel('')
plt.tight_layout();
plt.figure(figsize=(16, 4))
order = df['job'].value_counts().index
sns.countplot(data=df, x='job', palette='mako', order=order)
plt.title(f"Count of job")
plt.xlabel('')
plt.ylabel('')
plt.tight_layout();
Note:
In the dataset, it appears that categorical variables may exert a more significant influence on our final target variable compared to numerical data. To gain a deeper understanding of the impact of these variables, I will conduct further analysis in conjunction with the variable y at a later stage.
time_series = df.groupby(['month','day'])[['y']].value_counts().reset_index()
time_series_pivot = time_series.pivot_table(index=['month', 'day'], columns='y', values=0, fill_value=0)
time_series_pivot = time_series_pivot.reset_index()
time_series_pivot
| y | month | day | no | yes |
|---|---|---|---|---|
| 0 | 1 | 6 | 2 | 0 |
| 1 | 1 | 7 | 3 | 1 |
| 2 | 1 | 8 | 3 | 2 |
| 3 | 1 | 11 | 5 | 7 |
| 4 | 1 | 12 | 9 | 13 |
| ... | ... | ... | ... | ... |
| 313 | 12 | 27 | 1 | 0 |
| 314 | 12 | 28 | 7 | 9 |
| 315 | 12 | 29 | 6 | 7 |
| 316 | 12 | 30 | 3 | 1 |
| 317 | 12 | 31 | 1 | 0 |
318 rows × 4 columns
There is only 318 out of 365 days recorded.
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.lineplot(data=time_series_pivot, x='day', y='yes', marker='o', label='Yes')
#sns.lineplot(data=time_series_pivot, x='day', y='no', marker='o', label='No')
plt.ylabel('Count')
plt.title('Distribution of Term Deposit by Day')
plt.legend()
plt.subplot(1, 2, 2)
sns.lineplot(data=time_series_pivot, x='month', y='yes', marker='o', label='Yes')
#sns.lineplot(data=time_series_pivot, x='month', y='no', marker='o', label='No')
plt.ylabel('Count')
plt.title('Distribution of Term Deposit by Month')
plt.legend()
plt.tight_layout()
Note:
The time series analysis reveals intriguing patterns regarding the relationship between subscription to term deposits and specific time intervals.
Middle of the Month and Year: Subscriptions to the term deposit exhibit a distinctive trend during certain time frames. Specifically, a higher proportion of clients tend to subscribe to the term deposit during the middle of the month (from the 10th to the 20th) and the middle of the year (spanning from April to August). This suggests the presence of seasonal or cyclical factors that influence clients' decisions to subscribe during these particular periods.
End and Start of the Month: Additionally, the 'day' chart highlights a sudden surge in subscriptions at the end and beginning of the month. This could indicate an interesting pattern related to financial or budgetary cycles for clients.
Variability in Other Time Periods: On the other hand, subscription rates are comparatively lower during other time intervals, indicating potential variations in client behavior and preferences.
plt.figure(figsize=(7,5))
corr = df.corr(numeric_only=True)
sns.heatmap(
corr,
mask=np.triu(corr),
annot=True, fmt='.2f',
cbar_kws={'shrink':0.5},
# vmin=-1, vmax=1, center=0,
linewidth=0.1,
linecolor='w',
annot_kws={'size': 9},
cmap=sns.diverging_palette(220, 20, as_cmap=True)
)
plt.title('Correlation Heatmap');
Note:
Despite the heatmap revealing modest correlations among variables around 0, it's important to note that duration exhibits the highest correlation with the target variable.
sns.pairplot(df, corner=True, hue='y');
_, axes = plt.subplots(2, 4, figsize=(15, 8))
for i, col in enumerate(numeric_cols):
sns.boxplot(data=df, y='y', x=col, ax=axes[i//4, i%4], palette='mako')
if i == 0 or i == 4:
axes[i // 4, i % 4].set_ylabel("term deposit")
else:
axes[i // 4, i % 4].set_ylabel("")
axes[i // 4, i % 4].set_yticks([])
Note:
Upon initial examination, the box plots depicting categorical features segregated by term deposit status appear relatively similar. However, there are distinct variations observed in the distributions of duration, month, and pdays. These variables appear to have a substantial influence on the likelihood of successful subscriptions. Further investigation and analysis of these variables are warranted to understand their significance in more detail.
categorical_features = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'poutcome']
num_features = len(categorical_features)
fig, axes = plt.subplots(num_features, 2, figsize=(15, 32))
for i, col in enumerate(categorical_features):
order = df[col].value_counts().index
sns.countplot(data=df, x=col, hue='y', ax=axes[i, 0], palette='mako', order=order)
axes[i, 0].set_title(f'Count of {col} by term deposit')
axes[i, 0].set_xlabel('')
axes[i, 0].set_ylabel('')
axes[i, 0].legend(title='term deposit', labels=['No', 'Yes'])
axes[i, 0].set_xticklabels(axes[i, 0].get_xticklabels(), rotation=45, horizontalalignment='right')
df['y_numeric'] = df['y'].map({'no': 0, 'yes': 1})
probability = df.groupby(col)['y_numeric'].mean().sort_values(ascending=False)
sns.barplot(x=probability.index, y=probability.values, ax=axes[i, 1], palette='magma', order=order)
axes[i, 1].set_title(f'Probability of term deposit by {col}')
axes[i, 1].set_ylabel('')
axes[i, 1].set_xlabel('')
axes[i, 1].set_xticklabels(axes[i, 1].get_xticklabels(), rotation=45, horizontalalignment='right')
plt.tight_layout()
df=df.drop(['y_numeric'],axis=1);
Note:
Upon examining the right-hand side of the visualization, it becomes evident that the probabilities for clients subscribing to a term deposit based on different features are relatively low, ranging from around 0.1 to 0.2 overall.
Comparing the charts reveals several intriguing insights:
Job: While 'students' and 'retired' are among the least common job categories, they exhibit the highest probabilities of subscribing to a term deposit, approximately 0.3 and 0.25, respectively. On the other hand, despite 'blue-collar', 'management', and 'technician' representing the highest job counts, their probabilities are relatively similar to other professions.
Marital Status: Although individuals classified as 'married' exhibit the lowest subscription probability, they constitute the largest count in both subscribing and not subscribing groups. 'Single' individuals seem more promising both in terms of rate and count. However, 'divorced' individuals, while having a low count, show a higher subscription rate than 'married', suggesting that the bank should carefully consider targeting strategies for the next campaigns.
Education: 'Tertiary' education levels have the highest subscription likelihood, even though the count is only about half of those with 'secondary' education. The 'unknown' category is misleading due to its small count, but it boasts a subscription rate as high as 'tertiary'.
Default: The charts underscore that individuals with no default history are more prone to subscribing to a term deposit compared to those with a default history.
Housing: Similar to 'default', the rate of subscription for clients with housing loans is significantly higher, both in terms of scale and count, than for those without housing loans.
Loan: Similar to 'default', clients without personal loans are more likely to subscribe.
Contact Method: It is evident that clients contacted via 'cellular' communication have a higher likelihood of subscribing to a term deposit compared to those without this contact. Although clients contacted via 'telephone' have the lowest count, their conversion rate is as high as clients reached through 'cellular'.
Previous Campaign Outcome (poutcome): Despite being the least frequent category, clients who were successful in the previous campaign exhibit a 60% likelihood of subscribing to a term deposit, indicating a potential opportunity for follow-up. Besides, most of the customers are unknown for the previous campaign, and conversion rate is pretty low.
age Factor¶sns.histplot(data=df, x='age', hue='y')
<Axes: xlabel='age', ylabel='Count'>
balance Factor¶The feature balance holds significant importance in predicting whether a client will subscribe to a term deposit or not. This variable plays a crucial role in influencing the subscription outcome. Understanding the distribution and patterns within the balance feature can provide valuable insights into customer behavior and their likelihood to subscribe.
I will analyze the variations in balance across different categories and its relationship with the target variable:
categorical_features = ['job', 'marital', 'education', 'default', 'housing', 'loan']
num_rows = len(categorical_features) // 2 + len(categorical_features) % 2
fig, axes = plt.subplots(num_rows, 2, figsize=(15, 5*num_rows))
for i, feature in enumerate(categorical_features):
row = i // 2
col = i % 2
ax = axes[row, col]
if feature == 'month':
order = order_month
else:
order = df[feature].value_counts().index
sns.boxplot(data=df, x=feature, y='balance', hue='y', ax=ax, palette='mako', order=order)
ax.set_title(f'Box Plot of balance by {feature}')
ax.set_xlabel('')
if i % 2 == 1:
ax.set_ylabel('')
ax.legend(title='term deposit')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.tight_layout();
Note:
default, housing, and loan, individuals without default status and without a loan tend to have higher account balances. Moreover, this group is also more inclined to subscribe to the term deposit, indicating a potential relationship between financial stability and subscription likelihood.plt.figure(figsize=(10,5))
mean_balance = df['balance'].mean()
# First subplot: Bar plot of balance by term deposit
plt.subplot(1, 2, 1)
sns.barplot(data=df, x='y', y='balance', palette='mako')
plt.axhline(mean_balance, label='Mean Balance', linestyle='--', color='navy')
plt.legend()
plt.title('Mean Balance by Term Deposit');
# Second subplot: Histogram plot of balance with hue='y'
plt.subplot(1, 2, 2)
sns.histplot(data=df, x='balance', hue='y', multiple='stack', bins=range(0,6000,200), palette='mako')
plt.axvline(mean_balance, label='Mean Balance', linestyle='--', color='navy')
plt.title('Distribution of Balance by Term Deposit')
plt.text(1500, 8000, f"mean balance: {mean_balance:.2f}")
plt.tight_layout();
df[df['balance'] > mean_balance]['y'].value_counts(normalize=True)
no 0.84 yes 0.16 Name: y, dtype: float64
df[df['balance'] < mean_balance]['y'].value_counts(normalize=True)
no 0.90 yes 0.10 Name: y, dtype: float64
Note:
duration factor¶duration_mean = df['duration'].mean()
sns.histplot(data=df, x='duration', hue='y', bins=range(0,1400,25), multiple='stack', palette='mako')
plt.axvline(duration_mean, linestyle='--', label='Mean')
plt.text(300, 3600, f'mean duration: {duration_mean:.2f}');
Note;
The distribution of duration concerning term deposits reveals intriguing patterns. When examining the distribution among clients who responded with 'no,' a right-skewed curve is evident. This implies that a substantial number of clients who decline the term deposit tend to do so around the median duration value.
On the other hand, for clients who responded with 'yes,' the distribution is similarly right-skewed, but the tail extends significantly. This suggests that as the duration of the call increases, the likelihood of a client subscribing to the term deposit also increases. Notably, there are instances of outliers on the lower end of the duration spectrum, indicating that clients engaged in conversations lasting over 700 seconds are more inclined to subscribe.
While the histogram for 'no' responses experiences a sharp decline, the histogram for 'yes' responses remains relatively stable. This implies that the bank can effectively target individuals who engage in conversations exceeding the average duration.
In summary, these insights underscore the noteworthy influence of call duration on the decision to subscribe. Prolonged conversations tend to correlate with a higher likelihood of a positive outcome.
Term Deposit Rate with duration under mean value:
df[df['duration'] > duration_mean]['y'].value_counts(normalize=True).to_frame(name='deposit rate')
| deposit rate | |
|---|---|
| no | 0.75 |
| yes | 0.25 |
Term Deposit Rate with duration above mean value:
df[df['duration'] < duration_mean]['y'].value_counts(normalize=True).to_frame(name='deposit rate')
| deposit rate | |
|---|---|
| no | 0.95 |
| yes | 0.05 |
Term Deposit Rate with duration above 700 second:
df[df['duration'] > 700]['y'].value_counts(normalize=True).to_frame(name='deposit rate')
| deposit rate | |
|---|---|
| yes | 0.53 |
| no | 0.47 |
Note:
The subscription rate for term deposits displays significant variability based on call duration:
This indicates that for every 1000 customers the bank contacts:
These figures emphasize the substantial influence of call duration on subscription outcomes and underline the importance of efficiently managing call interactions to optimize subscription rates.
df['y'].value_counts(normalize=True).to_frame()
| y | |
|---|---|
| no | 0.88 |
| yes | 0.12 |
The dataset exhibits a notable class imbalance in terms of the target variable 'y':
This class imbalance is a common scenario in real-world datasets, and it presents a challenge for building predictive models. However, in this context, straightforward approaches like downsampling the majority class or upsampling the minority class might not be the best solution. The reason lies in the fact that the conversion rate of clients subscribing to the term deposit is inherently low.
In cases where the positive outcome is rare and the conversion rate is low, aggressive resampling methods can potentially introduce noise into the data and lead to overfitting. Therefore, a more nuanced strategy is required. Instead of focusing solely on balancing the class distribution, the emphasis should be on developing models that can accurately capture the patterns within this imbalanced dataset.
Efforts should be directed towards feature engineering, model selection, and performance evaluation metrics that are tailored to the business problem at hand.
Let's dive into these preprocessing steps to prepare the data for building and evaluating our classification model.
df_model = df.copy()
binary_cols = ['default', 'housing', 'loan', 'y']
for col in binary_cols:
df_model[col] = np.where(df_model[col] =='yes', 1, 0)
other_categorical_cols = ['job', 'marital', 'education', 'contact', 'poutcome']
for col in other_categorical_cols:
df_model = pd.get_dummies(df_model, columns=[col], prefix=[col])
X = df_model.drop(columns=['y']).copy()
y = df_model['y'].copy()
features = X.columns.to_list()
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.25,
stratify=y,
random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_train_scaled.shape, X_test_scaled.shape, y_train.shape, y_test.shape
((33908, 37), (11303, 37), (33908,), (11303,))
Now the preprocessing part is done, let's build classification models for the task:
To establish a solid foundation, I will create three essential functions that will be utilized iteratively throughout the model building:
evaluate_model function will be responsible for displaying various scoring metrics to assess model performance.conf_matrix_plot function will allow for clear visualization of the confusion matrix.plot_top_features function will facilitate the visualization of the top feature importances.These functions will serve as valuable tools in evaluating and presenting the results of our analysis and modeling efforts.
def evaluate_model(model_name, model_object, x_train, y_train, x_test, y_test, cv=True):
if cv:
# Training for GridSearchCV
cv_results = pd.DataFrame(model_object.cv_results_)
best_idx = cv_results['mean_test_f1'].idxmax()
best_estimator_results = cv_results.iloc[best_idx, :]
f1_train = best_estimator_results['mean_test_f1']
recall_train = best_estimator_results['mean_test_recall']
precision_train = best_estimator_results['mean_test_precision']
accuracy_train = best_estimator_results['mean_test_accuracy']
else:
# Training for single model
model_object.fit(x_train, y_train)
y_train_preds = model_object.predict(x_train)
f1_train = f1_score(y_train, y_train_preds)
recall_train = recall_score(y_train, y_train_preds)
precision_train = precision_score(y_train, y_train_preds)
accuracy_train = accuracy_score(y_train, y_train_preds)
# Testing
y_test_preds = model_object.predict(x_test)
f1_test = f1_score(y_test, y_test_preds)
recall_test = recall_score(y_test, y_test_preds)
precision_test = precision_score(y_test, y_test_preds)
accuracy_test = accuracy_score(y_test, y_test_preds)
# Store result in df
train_scores = pd.DataFrame({
'Model': [model_name],
'F1': [f1_train],
'Recall': [recall_train],
'Precision': [precision_train],
'Accuracy': [accuracy_train],
'Data': ['Train']
})
test_scores = pd.DataFrame({
'Model': [model_name],
'F1': [f1_test],
'Recall': [recall_test],
'Precision': [precision_test],
'Accuracy': [accuracy_test],
'Data': ['Test']
})
combined_scores = pd.concat([train_scores, test_scores], ignore_index=True)
return combined_scores
def conf_matrix_plot(model_name, model_object, x_data, y_data):
plt.figure(figsize=(4,4))
model_pred = model_object.predict(x_data)
cm = confusion_matrix(y_data, model_pred, labels=model_object.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model_object.classes_)
disp.plot(cmap='Blues')
plt.title(f"{model_name} Model confusion matrix");
def plot_top_features(feature_importances, model_name, feature_names, n=10):
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)
top_features = importance_df.head(n)
plt.figure(figsize=(8, 4))
sns.barplot(data=top_features, x='Importance', y='Feature', palette='mako')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title(f'Top {n} Features of model {model_name}')
sns.despine();
baseline = DummyClassifier(strategy='stratified', random_state=42)
baseline.fit(X_train_scaled, y_train)
DummyClassifier(random_state=42, strategy='stratified')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DummyClassifier(random_state=42, strategy='stratified')
baseline_result = evaluate_model('Baseline Model', baseline, X_train_scaled, y_train, X_test_scaled, y_test, cv=False)
baseline_result
| Model | F1 | Recall | Precision | Accuracy | Data | |
|---|---|---|---|---|---|---|
| 0 | Baseline Model | 0.12 | 0.12 | 0.12 | 0.79 | Train |
| 1 | Baseline Model | 0.11 | 0.11 | 0.11 | 0.79 | Test |
plt.rcParams["axes.spines.right"] = True
plt.rcParams["axes.spines.top"] = True
conf_matrix_plot('Baseline', baseline, X_test_scaled, y_test)
<Figure size 400x400 with 0 Axes>
During this session, I will employ a Decision Tree as a white-box model. This choice aims to enhance our comprehension of the underlying mechanisms. Additionally, I will utilize Random Forest and XGBoost as black-box models, leveraging their complexity for optimal performance and outcomes. This approach is geared towards achieving the most favorable results across different model types.
Decision Tree:
Random Forest:
XGBoost:
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train_scaled, y_train)
DecisionTreeClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=42)
tree_result = evaluate_model('Tree', tree, X_train_scaled, y_train, X_test_scaled, y_test, cv=False)
tree_result
| Model | F1 | Recall | Precision | Accuracy | Data | |
|---|---|---|---|---|---|---|
| 0 | Tree | 1.00 | 1.00 | 1.00 | 1.00 | Train |
| 1 | Tree | 0.48 | 0.47 | 0.48 | 0.88 | Test |
conf_matrix_plot('Tree', tree, X_test_scaled, y_test)
<Figure size 400x400 with 0 Axes>
We've observed that the unoptimized decision tree model exhibits signs of overfitting to the training data, emphasizing the necessity for hyperparameter tuning. Yet, the result with Decision Tree is much better than our baseline.
Now, I aim to fine-tune the model's parameters, mitigating the overfitting issue and achieving improved generalization performance on unseen data:
tree_params = {
'criterion': ['gini', 'entropy'],
'max_depth': [None, 5, 10, 15, 20, 25],
'min_samples_split': [2, 5, 10, 15, 20, 25],
'min_samples_leaf': [1, 2, 4, 6, 8, 10, 15, 20],
'class_weight': [None, 'balanced', {0: 0.2, 1: 0.8}, {0: 0.3, 1: 0.7}, {0: 0.4, 1: 0.6}]
}
scoring = ['f1', 'recall', 'precision', 'accuracy']
tree_cv = GridSearchCV(
tree,
tree_params,
scoring=scoring,
cv=5,
refit='f1',
n_jobs=-1,
verbose=True
)
%%time
tree_cv.fit(X_train_scaled, y_train)
Fitting 5 folds for each of 2880 candidates, totalling 14400 fits CPU times: total: 1min 17s Wall time: 9min 20s
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=42), n_jobs=-1,
param_grid={'class_weight': [None, 'balanced', {0: 0.2, 1: 0.8},
{0: 0.3, 1: 0.7}, {0: 0.4, 1: 0.6}],
'criterion': ['gini', 'entropy'],
'max_depth': [None, 5, 10, 15, 20, 25],
'min_samples_leaf': [1, 2, 4, 6, 8, 10, 15, 20],
'min_samples_split': [2, 5, 10, 15, 20, 25]},
refit='f1', scoring=['f1', 'recall', 'precision', 'accuracy'],
verbose=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=42), n_jobs=-1,
param_grid={'class_weight': [None, 'balanced', {0: 0.2, 1: 0.8},
{0: 0.3, 1: 0.7}, {0: 0.4, 1: 0.6}],
'criterion': ['gini', 'entropy'],
'max_depth': [None, 5, 10, 15, 20, 25],
'min_samples_leaf': [1, 2, 4, 6, 8, 10, 15, 20],
'min_samples_split': [2, 5, 10, 15, 20, 25]},
refit='f1', scoring=['f1', 'recall', 'precision', 'accuracy'],
verbose=True)DecisionTreeClassifier(random_state=42)
DecisionTreeClassifier(random_state=42)
tree_cv.best_params_
{'class_weight': {0: 0.3, 1: 0.7},
'criterion': 'entropy',
'max_depth': 10,
'min_samples_leaf': 20,
'min_samples_split': 2}
After thorough hyperparameter tuning, we have identified the optimal parameters for our decision tree model:
class_weight: {0: 0.3, 1: 0.7}criterion: 'entropy'max_depth: 10min_samples_leaf: 20min_samples_split: 2These parameters were selected based on a careful evaluation of various combinations to ensure the best possible performance of our model in predicting whether a client subscribes to a term deposit. These parameter settings were found to yield the highest accuracy and overall effectiveness in capturing the underlying patterns in the data while mitigating overfitting.
tree_tuned_result = evaluate_model('Tree Tuned CV', tree_cv, X_train_scaled, y_train, X_test_scaled, y_test, cv=True)
tree_tuned_result
| Model | F1 | Recall | Precision | Accuracy | Data | |
|---|---|---|---|---|---|---|
| 0 | Tree Tuned CV | 0.58 | 0.70 | 0.50 | 0.88 | Train |
| 1 | Tree Tuned CV | 0.58 | 0.67 | 0.50 | 0.88 | Test |
conf_matrix_plot('Tree Tuned CV', tree_cv, X_test_scaled, y_test)
<Figure size 400x400 with 0 Axes>
plot_top_features(tree_cv.best_estimator_.feature_importances_, 'Tree Tuned', features, n=15)
plt.figure(figsize=(20,10))
plot_tree(
tree_cv.best_estimator_,
max_depth=3,
filled=True,
rounded=True,
class_names=['No', 'Yes'],
feature_names = features,
proportion=True,
fontsize=10
);
rf = RandomForestClassifier(n_estimators=50, random_state=42)
rf.fit(X_train_scaled, y_train)
RandomForestClassifier(n_estimators=50, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(n_estimators=50, random_state=42)
forest_result = evaluate_model('Forest', rf, X_train_scaled, y_train, X_test_scaled, y_test, cv=False)
forest_result
| Model | F1 | Recall | Precision | Accuracy | Data | |
|---|---|---|---|---|---|---|
| 0 | Forest | 1.00 | 1.00 | 1.00 | 1.00 | Train |
| 1 | Forest | 0.47 | 0.37 | 0.67 | 0.90 | Test |
conf_matrix_plot('Random Forest', rf, X_test_scaled, y_test)
<Figure size 400x400 with 0 Axes>
forest_params = {
'n_estimators': [100, 150, 200],
'max_depth': [None, 8, 15],
'min_samples_split': [5, 10, 15],
'min_samples_leaf': [1, 5, 10],
'max_features':['sqrt', 0.5, 0.8],
'class_weight': [None, 'balanced', {0: 0.2, 1: 0.8}, {0: 0.3, 1: 0.7}, {0: 0.4, 1: 0.6}]
}
scoring = ['accuracy', 'precision', 'recall', 'f1']
rf_cv = GridSearchCV(
rf,
forest_params,
scoring=scoring,
cv=5,
refit='f1',
n_jobs=-1,
verbose=True
)
%%time
rf_cv.fit(X_train_scaled, y_train)
Fitting 5 folds for each of 1215 candidates, totalling 6075 fits CPU times: total: 1min 5s Wall time: 2h 46min 13s
GridSearchCV(cv=5,
estimator=RandomForestClassifier(n_estimators=50, random_state=42),
n_jobs=-1,
param_grid={'class_weight': [None, 'balanced', {0: 0.2, 1: 0.8},
{0: 0.3, 1: 0.7}, {0: 0.4, 1: 0.6}],
'max_depth': [None, 8, 15],
'max_features': ['sqrt', 0.5, 0.8],
'min_samples_leaf': [1, 5, 10],
'min_samples_split': [5, 10, 15],
'n_estimators': [100, 150, 200]},
refit='f1', scoring=['accuracy', 'precision', 'recall', 'f1'],
verbose=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=5,
estimator=RandomForestClassifier(n_estimators=50, random_state=42),
n_jobs=-1,
param_grid={'class_weight': [None, 'balanced', {0: 0.2, 1: 0.8},
{0: 0.3, 1: 0.7}, {0: 0.4, 1: 0.6}],
'max_depth': [None, 8, 15],
'max_features': ['sqrt', 0.5, 0.8],
'min_samples_leaf': [1, 5, 10],
'min_samples_split': [5, 10, 15],
'n_estimators': [100, 150, 200]},
refit='f1', scoring=['accuracy', 'precision', 'recall', 'f1'],
verbose=True)RandomForestClassifier(n_estimators=50, random_state=42)
RandomForestClassifier(n_estimators=50, random_state=42)
rf_cv.best_params_
{'class_weight': {0: 0.2, 1: 0.8},
'max_depth': None,
'max_features': 0.5,
'min_samples_leaf': 5,
'min_samples_split': 15,
'n_estimators': 150}
After extensive hyperparameter tuning, the Random Forest model yielded the best results with the following set of hyperparameters:
class_weight: {0: 0.2, 1: 0.8}max_depth: Nonemax_features: 0.5min_samples_leaf: 5min_samples_split: 15n_estimators: 150forest_tuned_result = evaluate_model('Forest Tuned CV', rf_cv, X_train_scaled, y_train, X_test_scaled, y_test, cv=True)
forest_tuned_result
| Model | F1 | Recall | Precision | Accuracy | Data | |
|---|---|---|---|---|---|---|
| 0 | Forest Tuned CV | 0.62 | 0.71 | 0.56 | 0.90 | Train |
| 1 | Forest Tuned CV | 0.62 | 0.71 | 0.56 | 0.90 | Test |
conf_matrix_plot('Random Forest Tuned CV', rf_cv, X_test_scaled, y_test)
<Figure size 400x400 with 0 Axes>
plot_top_features(rf_cv.best_estimator_.feature_importances_, 'Random Forest Tuned', features, n=15)
xgb = XGBClassifier(objective='binary:logistic', random_state=42)
xgb.fit(X_train_scaled, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=42, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=42, ...)xgb_result = evaluate_model('XGBoost', xgb, X_train_scaled, y_train, X_test_scaled, y_test, cv=False)
xgb_result
| Model | F1 | Recall | Precision | Accuracy | Data | |
|---|---|---|---|---|---|---|
| 0 | XGBoost | 0.80 | 0.72 | 0.88 | 0.96 | Train |
| 1 | XGBoost | 0.52 | 0.45 | 0.62 | 0.90 | Test |
conf_matrix_plot('XGBoost', xgb, X_test_scaled, y_test)
<Figure size 400x400 with 0 Axes>
xgb_params = {
'max_depth': [None, 5, 10, 15],
'min_child_weight': [3, 5, 7, 10],
'learning_rate': [0.05, 0.1, 0.2, 0.3],
'n_estimators': [150, 200, 300],
}
scoring = ['accuracy', 'precision', 'recall', 'f1']
xgb_cv = GridSearchCV(
xgb,
xgb_params,
scoring=scoring,
cv=5,
refit='f1',
n_jobs=-1,
verbose=True
)
%%time
xgb_cv.fit(X_train_scaled, y_train)
Fitting 5 folds for each of 192 candidates, totalling 960 fits CPU times: total: 42.4 s Wall time: 33min 11s
GridSearchCV(cv=5,
estimator=XGBClassifier(base_score=None, booster=None,
callbacks=None, colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None,
early_stopping_rounds=None,
enable_categorical=False, eval_metric=None,
feature_types=None, gamma=None,
gpu_id=None, grow_policy=None,
importance_type=None,
interaction_constraints=None,
learning_rate=None,...
max_leaves=None, min_child_weight=None,
missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None,
num_parallel_tree=None, predictor=None,
random_state=42, ...),
n_jobs=-1,
param_grid={'learning_rate': [0.05, 0.1, 0.2, 0.3],
'max_depth': [None, 5, 10, 15],
'min_child_weight': [3, 5, 7, 10],
'n_estimators': [150, 200, 300]},
refit='f1', scoring=['accuracy', 'precision', 'recall', 'f1'],
verbose=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=5,
estimator=XGBClassifier(base_score=None, booster=None,
callbacks=None, colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None,
early_stopping_rounds=None,
enable_categorical=False, eval_metric=None,
feature_types=None, gamma=None,
gpu_id=None, grow_policy=None,
importance_type=None,
interaction_constraints=None,
learning_rate=None,...
max_leaves=None, min_child_weight=None,
missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None,
num_parallel_tree=None, predictor=None,
random_state=42, ...),
n_jobs=-1,
param_grid={'learning_rate': [0.05, 0.1, 0.2, 0.3],
'max_depth': [None, 5, 10, 15],
'min_child_weight': [3, 5, 7, 10],
'n_estimators': [150, 200, 300]},
refit='f1', scoring=['accuracy', 'precision', 'recall', 'f1'],
verbose=True)XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=42, ...)XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=42, ...)xgb_cv.best_params_
{'learning_rate': 0.2,
'max_depth': None,
'min_child_weight': 10,
'n_estimators': 200}
Optimized hyperparameters:
learning_rate: 0.2max_depth: Nonemin_child_weight: 10n_estimators: 200xgb_tuned_result = evaluate_model('XGBoost Tuned CV', xgb_cv, X_train_scaled, y_train, X_test_scaled, y_test, cv=True)
xgb_tuned_result
| Model | F1 | Recall | Precision | Accuracy | Data | |
|---|---|---|---|---|---|---|
| 0 | XGBoost Tuned CV | 0.56 | 0.50 | 0.65 | 0.91 | Train |
| 1 | XGBoost Tuned CV | 0.54 | 0.46 | 0.63 | 0.91 | Test |
conf_matrix_plot('XGBoost Tuned CV', xgb_cv, X_test_scaled, y_test)
<Figure size 400x400 with 0 Axes>
plot_top_features(xgb_cv.best_estimator_.feature_importances_, 'XGBoost Tuned', features, n=15)
final_results = pd.concat(
[
baseline_result,
tree_result,
tree_tuned_result,
forest_result,
forest_tuned_result,
xgb_result,
xgb_tuned_result
], axis=0
).reset_index(drop=True)
final_results
| Model | F1 | Recall | Precision | Accuracy | Data | |
|---|---|---|---|---|---|---|
| 0 | Baseline Model | 0.12 | 0.12 | 0.12 | 0.79 | Train |
| 1 | Baseline Model | 0.11 | 0.11 | 0.11 | 0.79 | Test |
| 2 | Tree | 1.00 | 1.00 | 1.00 | 1.00 | Train |
| 3 | Tree | 0.48 | 0.47 | 0.48 | 0.88 | Test |
| 4 | Tree Tuned CV | 0.58 | 0.70 | 0.50 | 0.88 | Train |
| 5 | Tree Tuned CV | 0.58 | 0.67 | 0.50 | 0.88 | Test |
| 6 | Forest | 1.00 | 1.00 | 1.00 | 1.00 | Train |
| 7 | Forest | 0.47 | 0.37 | 0.67 | 0.90 | Test |
| 8 | Forest Tuned CV | 0.62 | 0.71 | 0.56 | 0.90 | Train |
| 9 | Forest Tuned CV | 0.62 | 0.71 | 0.56 | 0.90 | Test |
| 10 | XGBoost | 0.80 | 0.72 | 0.88 | 0.96 | Train |
| 11 | XGBoost | 0.52 | 0.45 | 0.62 | 0.90 | Test |
| 12 | XGBoost Tuned CV | 0.56 | 0.50 | 0.65 | 0.91 | Train |
| 13 | XGBoost Tuned CV | 0.54 | 0.46 | 0.63 | 0.91 | Test |
models = [baseline, tree, tree_cv, rf, rf_cv, xgb, xgb_cv]
model_names = ['Baseline', 'Decision Tree', 'Tuned Decision Tree', 'Random Forest', 'Tuned Random Forest', 'XGBoost', 'Tuned XGBoost']
plt.figure(figsize=(8, 6))
for model, name in zip(models, model_names):
y_pred_probs = model.predict_proba(X_test_scaled)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)
auc_score = roc_auc_score(y_test, y_pred_probs)
plt.plot(fpr, tpr, label=f'{name} (AUC = {auc_score:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend();
Note:
In our analysis, we are confronted with a classic scenario known as the Recall-Precision Tradeoff. This signifies that as the recall value increases, the corresponding precision value tends to decrease, and vice versa. Examining our models, we can observe this tradeoff quite clearly through the most finest models:
Forest Tuned CV: This model showcases a high recall rate of 72%, indicating its capacity to identify a significant portion of potential subscribers. However, the tradeoff is a precision of 56%, implying that it might also classify some non-subscribers as potential subscribers.
XGBoost Tuned CV: This model, on the other hand, presents a higher precision of 63%, suggesting that it is cautious when classifying potential subscribers. However, this is accompanied by a lower recall of 46%, indicating that it might miss some actual potential subscribers in the process.
Here's a snapshot of the performance metrics:
| Model | F1 | Recall | Precision | Accuracy | Data |
|---|---|---|---|---|---|
| Forest Tuned CV | 0.62 | 0.71 | 0.56 | 0.90 | Train |
| Forest Tuned CV | 0.62 | 0.71 | 0.56 | 0.90 | Test |
| XGBoost Tuned CV | 0.54 | 0.46 | 0.63 | 0.91 | Train |
| XGBoost Tuned CV | 0.54 | 0.46 | 0.63 | 0.91 | Test |
Considering this tradeoff, the choice of model depends on the specific goals of the bank:
High Recall with Tolerable Precision: If the bank aims to reach as many potential subscribers as possible and is willing to accept a lower precision, the Forest Tuned CV model might be preferable due to its higher recall.
Balanced Tradeoff: For a balanced approach between recall and precision, where the goal is to identify potential subscribers without compromising overall accuracy, the Forest Tuned CV model is a strong contender.
High Precision with Tolerable Recall: If the bank's main priority is to minimize false positives and achieve a higher precision, even if it means slightly lower recall, the XGBoost Tuned CV model could be the right choice.
Ultimately, the decision should align with the bank's specific objectives and tolerance for false positives and false negatives. It's important to strike a balance between recall and precision that aligns with the bank's marketing and business strategy.
final_model = RandomForestClassifier(class_weight={0: 0.2, 1: 0.8},
max_depth=None,
max_features=0.5,
min_samples_leaf=5,
min_samples_split=15,
n_estimators=150,
random_state=42)
final_model.fit(X_train_scaled, y_train)
y_pred = final_model.predict(X_test_scaled)
fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)
roc_auc = roc_auc_score(y_test, y_pred_probs)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("ROC AUC Score:", roc_auc)
Confusion Matrix:
[[9249 732]
[ 389 933]]
Classification Report:
precision recall f1-score support
0 0.96 0.93 0.94 9981
1 0.56 0.71 0.62 1322
accuracy 0.90 11303
macro avg 0.76 0.82 0.78 11303
weighted avg 0.91 0.90 0.91 11303
ROC AUC Score: 0.9309275369040814
Feature Importances on Original feature:
feature_importances = final_model.feature_importances_
features_df = pd.DataFrame({
'Feature': features,
'Importance': feature_importances
})
original_features = original_df.columns.tolist()
features_df['Original Feature'] = features_df['Feature'].apply(lambda x: next((original_feature for original_feature in original_features if x.startswith(original_feature)), None))
final_original_features = features_df.groupby('Original Feature')[['Importance']].sum().sort_values(by='Importance', ascending=False).reset_index()
sns.barplot(data=final_original_features, x='Importance', y='Original Feature', palette='Blues_r')
plt.xlabel('Importance')
plt.ylabel('Original Feature')
plt.title('Feature Importances for Original Features')
plt.tight_layout()
for i, v in enumerate(final_original_features['Importance']):
plt.text(v + 0.005, i, f'{v:.3f}', color='black', va='center', fontsize=8)
sns.despine();
Here are the key insights from the feature importance analysis:
Duration: The duration of the call is the most critical factor influencing a client's decision to subscribe to a term deposit, contributing significantly to the predictive power of the model (Importance: 0.41).
Previous Campaign Outcome (poutcome): The outcome of the previous marketing campaign plays a notable role (Importance: 0.10), indicating that clients who responded positively before are likely to do so again.
Time Factors (month and day): The month and day of contact have considerable importance (Importance: 0.09 and 0.07, respectively). Certain months (April to August, particularly May and June) and specific days (around the middle of the month and the end/beginning of the month) exhibit higher subscription rates.
Age and Balance: While individual attributes such as age (Importance: 0.06) and average yearly balance (balance, Importance: 0.06) have an impact, they are not as influential as the time-related and call duration factors.
Contact Communication Type (contact): The method of contact (cellular, telephone) has a modest effect (Importance: 0.04), suggesting that the communication channel does contribute to the decision-making process.
Previous Campaign Contact Days (pdays): The number of days since the client was last contacted in a previous campaign has some influence (Importance: 0.04).
Housing, Job, and Campaign Count: Variables like housing (housing), job type (job), and the number of contacts made during this campaign (campaign) have minor contributions (Importance: 0.02) to the model's predictive power.
Previous Contacts (previous), Education, Marital Status, Loan, and Default: Factors such as the number of contacts performed before this campaign (previous), education level (education), marital status (marital), having a personal loan (loan), and credit in default (default) have limited impact (Importance: 0.01) on subscription likelihood.
Based on the analysis and feature importance:
Avoid Default and Loan: Focus on clients without credit defaults and personal loans, as these categories have the lowest contribution to subscription success.
Enhance Call Quality and Duration: Implement strategies to improve the quality of calls and increase call duration. Ensuring engaging and informative conversations can significantly improve the likelihood of subscription, with calls lasting around 4 minutes or longer yielding the highest success rates.
Target Previous Positive Respondents: Concentrate efforts on clients who responded positively in the previous campaign, as they are more likely to subscribe again.
Strategic Timing: Plan campaigns to coincide with months that exhibit higher subscription rates (April to August), particularly targeting the middle of the month (around days 10 to 20) and periods at the end and beginning of each month. These temporal patterns offer increased chances of success.
These recommendations offer a strategic framework for optimizing the bank's marketing campaigns and enhancing subscription rates for term deposits.